A Comparative Performance Study of Feature Selection Methods for the Anti-spam Filtering Domain

نویسندگان

  • José Ramon Méndez
  • Florentino Fernández Riverola
  • Fernando Díaz
  • Eva Lorenzo Iglesias
  • Juan M. Corchado
چکیده

In this paper we analyse the strengths and weaknesses of the mainly used feature selection methods in text categorization when they are applied to the spam problem domain. Several experiments with different feature selection methods and content-based filtering techniques are carried out and discussed. Information Gain, χ-text, Mutual Information and Document Frequency feature selection methods have been analysed in conjunction with Naïve Bayes, boosting trees, Support Vector Machines and ECUE models in different scenarios. From the experiments carried out the underlying ideas behind feature selection methods are identified and applied for improving the feature selection process of SpamHunting, a novel anti-spam filtering software able to accurate classify suspicious e-mails.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Searching for Interacting Features for Spam Filtering

In this paper, we propose a novel feature selection method— INTERACT to select relevant words of emails for spam email filtering, i.e. classifying an email as spam or legitimate. Four traditional feature selection methods in text categorization domain, Information Gain, Gain Ratio, Chi Squared, and ReliefF, are also used for performance comparison. Three classifiers, Support Vector Machine (SVM...

متن کامل

A Classification Method for E-mail Spam Using a Hybrid Approach for Feature Selection Optimization

Spam is an unwanted email that is harmful to communications around the world. Spam leads to a growing problem in a personal email, so it would be essential to detect it. Machine learning is very useful to solve this problem as it shows good results in order to learn all the requisite patterns for classification due to its adaptive existence. Nonetheless, in spam detection, there are a large num...

متن کامل

Towards Spam Mail Detection using Robust Feature Evaluated with Feature Selection Techniques

Filtering of spam emails is a significant operation in email system. The efficiency of this process is determined by many factors such as number of features, representation of samples, classifier etc. This study covers all these factors and aims to find the optimal settings for email spam filtering. Twelve feature selection methods extensively used in text categorization are implemented to synt...

متن کامل

A new feature selection algorithm based on binomial hypothesis testing for spam filtering

Content-based spam filtering is a binary text categorization problem. To improve the performance of the spam filtering, feature selection, as an important and indispensable means of text categorization, also plays an important role in spam filtering. We proposed a new method, named Bi-Test, which utilizes binomial hypothesis testing to estimate whether the probability of a feature belonging to ...

متن کامل

Linger – a Smart Personal Assistant for E-mail Classification

In this paper we present Linger a neural network based system for automated e-mail classification. Two scenarios are explored: filing e-mails into folders and spam e-mail filtering. Extensive experiments indicate that Linger compares favourably to other classification techniques. We study the effects of various feature selection, weighting and normalization methods, and also the portability of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006